Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.
Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.
Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.
Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.
The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.
First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.
Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.
What is the evolution of student performance over time and across the different regions/departments of France?
Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?
Has the COVID-19 pandemic impacted student performance?
This dataset presents the results of the “diplôme national du brevet” by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.
| Variable | Meaning |
|---|---|
| session | Year of the exam session |
| school_id | School identification number |
| school_type | School type divided in six categories: COLLEGE, LYCEE PROFESSIONNEL, LYCEE, EREA, CFA, and AUTRE |
| establishment_name | Name of the establishment |
| education_sector | Education sector categorised as public or private |
| municipality_code | Municipality code |
| municipality | Name of the municipality |
| department_code | Department code. It is to be noted that France has 101 departements. |
| department | Name of the department |
| academy_code | Academy code |
| academy_name | Name of the academy |
| region_code | Region code. It is to be noted that France has 18 administrative, regions |
| region | Name of the region |
| registered | Registered candidates |
| present | Candidates present for the exam |
| admitted | Candidates admitted |
| admitted_without | Candidates admitted without distinction |
| admitted_AB | Candidates admitted with distinction “Assez Bien” |
| admitted_B | Candidates admitted with distinction “Bien” |
| admitted_TB | Candidates admitted with distinction “Très bien” |
| success_rate | “Success rate [Present]/[Admis] as a percentage” |
This data set gathers all schools which have been awarded the “Generation 2024” label. The objective of this label, developed in view of the Paris 2024 Olympic Games, is to develop bridges between the school world and the sports movement in order to encourage young people to take part in physical activity and sport. This data set contains 6’883 observations.
| Variable | Meaning |
|---|---|
| region | Name of the region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| establishment | Name of the establishment |
| school_id | School identification number |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| postcode | Postcode |
| adress | Address of the establishment |
| adress_2 | Additional address of the establishment |
| E-mail address of the establishment | |
| students | Number of students in the school |
| priority_education | Indicates whether the school is located in a priority education network (REP) or a reinforced priority education network (REP+) |
| city_school | Indicates whether the school is part of a city school |
| QPV | Position relative to a priority neighbourhood of the city policy. It is a policy aimed at compensating for differences in living standards with the rest of the territory. |
| ULIS | “Indicates whether the school offers a ULIS (Localized Unit for School Inclusion)” |
| SEGPA | “Indicates whether the school has a SEPGA (adapted general and vocational education sections)” |
| sport_section | Indicates whether the school has a sports section |
| agricultural_high_school | Indicates whether the school is an agricultural high school |
| military_high_school | Indicates whether the school is a military high school |
| vocational_high_school | Indicates whether the establishment is labeled “vocational high school” |
| establishment_web | Url of the description of the establishment page on the ONISEP website |
| SIREN_SIRET | “SIREN/SIRET number of the establishment. SIREN is for Business Register Identification System in french.” |
| district | Name of the district to which the school is attached |
| ministry | Ministry responsible for the institution |
| label_start_date | Start date of the “generation 2024” label. Format yyyy/mm/dd |
| label_end_date | End date of the “generation 2024” label. Format yyyy/mm/dd |
| y_coordinate | Y coordinate of the establishment, using the EPSG coordinate system |
| x_coordinate | X coordinate of the establishment, using the EPSG coordinate system |
| epsg | EPSG code of the coordinate system used to locate the establishment |
| precision_on_localisation | Specification of the geographical location of the establishment |
| latitude | Latitude |
| longitude | Longitude |
| position | Geographical position |
| engaging_30_sport | Indicates whether the institution participates in the 30 minutes of daily physical activity programme |
This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.
| Variable | Meaning |
|---|---|
| year_back_to_school | Year of the start of the school year |
| region | Name of the academic region |
| academy | Name of the academy |
| department | Name of the department |
| municipality | Name of the municipality |
| number | School identification number |
| establishment_main_name | Main name of the establishment |
| establishment_name | Name of the establishment |
| school_type | School type |
| education_sector | Education sector categorised as public or private |
| students_secondary_education | Students in secondary education |
| students_higher_education | Number of students in higher education |
| external_students_secondary_education | External students in secondary education |
| half_boarders_students_secondary_education | Half-boarders in secondary education |
| boarding_students_secondary_education | Boarding students in secondary education |
| external_students_higher_education | External students in higher education |
| half_board_students_higher_education | Half-board students in higher education |
| boarding_students_higher_education | Boarding students in higher education |
This data set provides information about the single-parent families in each municipality. The census is made two years in a row every five years since 2007. This data set contains 606 observations.
| Variable | Meaning |
|---|---|
| geocode | Departmental code from INSEE |
| libgeo | Name of department |
| year | Census year |
| sing_par | Number of single-parent families |
This is a time based data set that gives us information on the COVID 19 tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 30th November, the data set contained 82’394 observations.
| Variable | Meaning |
|---|---|
| department_code | Department code |
| test_week | Week of the tests. Format yyyy-Sww |
| educational_level | Description of the age group as [m-n], m and n being the lower and upper limits. |
| age_group | Denomination of the age group. n-1 is used in this case excepet for the oldest group where 18 is used |
| pop | Population |
| positive | Weekly patients testing positive |
| tested | Weekly patients tested |
| incidence_rate | Incidence rate |
| positivity_rate | Positivity rate |
| screening_rate | Screening rate |
Loading of the data All but the single_parent dataset are CSV files with semicolons as separators. The single_parent data set is in excel format, so we have to use read_excel. We used “skip = 4” because the document includes extra header information rows.
DNB_par_etablissement <- read_delim(here::here("data/DNB-par-etablissement.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
Etablissements_labellises_generation_2024 <- read_delim(here::here("data/Etablissements-labellises-generation-2024.csv"),";", escape_double = FALSE, trim_ws = TRUE)
Hebergement_eleves_etablissements_2d <- read_delim(here::here("data/Hebergement-eleves-etablissements-2d.csv"), ";", escape_double = FALSE, trim_ws = TRUE)
insee_rp_hist_xxxx <- read_excel(here::here("data/insee_rp_hist_xxxx.xlsx"), skip = 4)
covid_sp_dep_heb_cage_scol_2022_11_30_19h01 <- read_delim(here::here("data/covid_sp_dep_heb_cage_scol_2022_11_30_19h01.csv"), ";", escape_double = FALSE, trim_ws = TRUE)We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :
rename_df <- function(df, x){
if (ncol(df) == length(x)){
names(df) <- c(x)
df <- as_tibble(df)
} else {
stop("Vector is not the right length")
}
}
dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)xx,xx% we want it as a
double of the form xx.xx
dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
gsub(",",".", dnb_results[["success_rate_pct"]])))department_fr and drop the
overseas collectivities (COM).
dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
dnb_results <- dnb_results %>%
dplyr::filter(!department_fr %in% c("Polynesie-Française","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "NA", "-"))dnb_results <- dnb_results %>%
mutate(without_pct = admitted_without/admitted*100,
AB_pct = admitted_AB/admitted*100,
B_pct = admitted_B/admitted*100,
TB_pct = admitted_TB/admitted*100
)dnb_results_dep <- dnb_results %>%
select(session, department_code, region:TB_pct) %>%
group_by(department_fr, session, region) %>%
summarise(registered = sum(registered),
present = sum(present),
admitted = sum(admitted),
admitted_without = sum(admitted_without),
admitted_AB = sum(admitted_AB),
admitted_B = sum(admitted_B),
admitted_TB = sum(admitted_TB),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE))We can see the summarized data set dnb_results_dep below.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
establishment_24 <- establishment_24 %>%
mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
month(label_start_date) > 7 ~ year(label_start_date)+1),
session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
month(label_end_date) > 7 ~ year(label_end_date)+1)
)department_fr.
establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")We can see on the map below, that the data set contains establishments from the overseas collectivities (COM) but from the French international schools as well.
As previsouly discussed we have decided to keep only data from mainland France. We had to make sure that we also removed the French international schools. We took the opportunity to remove unused variables.
establishment_24 <- establishment_24 %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")#"NA" and "-" makes sure that we have no more International schools.
#establishment_24 has a lot of variables which we will for sure not use
establishment_24 <- establishment_24 %>%
select(-c(postcode:mail,city_school,QPV:SEGPA,establishment_web:ministry, precision_on_localisation))The “Diplome National du Brevet” is the diploma received at the end of “collège”. We have to keep establishments which are “collège” and drop all the rest.
Other data sets only go down to the department level. We therefore have to create a simplified version for further merges between data sets. We will summarize by counting the number of establishment created each session by department. The end of the label period is not of interest for us as the first end date is on the 9.01.2021 which is categorised as session 2022 and we have results for dnb up until session 2021.
establishment_24_dep <- establishment_24 %>%
select(region, department_fr, session_started) %>%
group_by(department_fr, session_started) %>%
summarise(establishment = n())We can see the summarized data set establishment_24_dep below.
housing_names <- c("year_back_to_school", "region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)session variable as
year_back_to_school refers to the beginning of the school
year and not the exam session.
student_housing <- student_housing %>%
mutate(session = year_back_to_school + 1) %>%
select(year_back_to_school,session, everything()) #here just to order variablesdepartment_fr and remove the
departments outside mainland France.student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
student_housing <- student_housing %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")student_housing <- student_housing %>%
select(-c(contains("higher")))student_housing <- student_housing %>%
mutate(external_students_rate = external_students_secondary_education/students_secondary_education*100,
half_boarders_students_rate = half_boarders_students_secondary_education/students_secondary_education*100,
boarding_students_rate = boarding_students_secondary_education/students_secondary_education*100)housing_dep <- student_housing %>%
select(session, region, students_secondary_education:boarding_students_rate) %>%
group_by(session, department_fr) %>%
summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE),
boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE),
students_secondary_education = sum(students_secondary_education, na.rm = TRUE),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE))We can see the summarized data set housing_dep below.
sg_parent_names <- c("geocode", "department", "session","sing_par")
single_parent <- rename_df(insee_rp_hist_xxxx, sg_parent_names)single_parent[["session"]]<- as.double(single_parent[["session"]])
single_parent[["sing_par"]]<- as.double(single_parent[["sing_par"]])department_fr and remove the
department outside of mainland France.single_parent$department_fr <- stri_trans_general(single_parent$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
single_parent <- single_parent %>%
dplyr::filter(!department_fr %in% c("Polynesie-Francaise","Guyane", "Martinique", "Guadeloupe", "La-Reunion", "Mayotte", "Saint-Martin", "-")) %>%
dplyr::filter(!department_fr == "NA")We can see the final table single_parent below.
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(covid_sp_dep_heb_cage_scol_2022_11_30_19h01,covide_names)covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))test_date categorizing each week. We chose the first day of
the week. As we had the week number we had to select the week number and
then for each year add seven days to the Monday of the first week of the
year. The second variable is the session. A session is
categorised from August to July of the next year. As our argument will
be set on the month, we might have some test done the first days of
august count towards the “wrong” session. The number of Covid cases in
August are relatively low compared to the rest of the year and it
represents at maximum 6 days of tests. Therefore we consider this margin
of error to be satisfactory.covid_in_schools <- covid_in_schools %>%
mutate(test_date = case_when (as.numeric(substr(test_week, 1,4))== 2020
~ lubridate::ymd('2019-12-30') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2021
~ lubridate::ymd('2021-01-04') + lubridate::weeks(as.numeric(substr(test_week, 7,8))),
as.numeric(substr(test_week, 1,4))== 2022
~ lubridate::ymd('2022-01-03') + lubridate::weeks(as.numeric(substr(test_week, 7,8)))),
session = case_when(month(test_date) <= 7 ~ year(test_date),
month(test_date) > 7 ~ year(test_date)+1))
department_fr and region. We use the
department_fr and region variables from
dnb_results. We join the two datasets by department_code.
To do this, we first need to match the two by removing the first
character of the department_code from dnb_results.reg_department <- dnb_results %>%
select(c("department_code", "department_fr", "region")) %>%
unique()
reg_department$department_code <- substring(reg_department$department_code, 2)
covid_in_schools <- right_join(x = covid_in_schools, y = reg_department, by = "department_code")covid_in_schools <- covid_in_schools %>%
filter(educational_level == "[11-15)")We can see the final table covid_in_schools below.
We will use the ggplot France map for our visualizations
map <- map_data("france")The region variable is in fact the departments. We rename it “department_fr” to fit in with the other data sets.
colnames(map)[5]<- "department_fr"To explore this data set we have decided to start on a national level to analyse the global tendency. We will then go down a level to a regional analysis to compare the number of students and see which region performs better. An analysis at the regional level will then be performed to dig deeper into the success rate and the graduation rate for each mention. To be complete with our analysis, we will see the results by establishment for the best and worst performing establishments in 2020. We will use their results of 2006 in comparison.
We created a data set summarizing the results at the national level.
We simplified our visualisation process by pivoting the table. During
the process, we created four new variables Candidates,
Number_of_students, Mention_type and
Rate.
France_results <- dnb_results %>%
group_by(session) %>%
summarise(registered = sum(registered),
present = sum(present),
admitted = sum(admitted),
admitted_without = sum(admitted_without),
admitted_AB = sum(admitted_AB),
admitted_B = sum(admitted_B),
admitted_TB = sum(admitted_TB),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
pivot_longer(c(registered, present,contains("admitted")),
names_to = "Candidates",
values_to = "Number_of_students") %>%
pivot_longer(c(contains("pct")),
names_to = "Mention_type",
values_to = "Rate")For this first graph, we plotted the Number_of_students
and grouped them by Candidates.
p <- France_results %>%
ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Number of students")
ggplotly(p, tooltip = c("x" ,"y"))According to the graph, we see that the number of admissions is increasing over the years. In fact, there is 704’742 admitted students in 2021 compared to only 572’236 in 2006. Though, we notice a slight decrease in 2018 and 2019. The positive trend resumed after that till 2021 to reach its peak in 2020 with 716’237 admissions. Similarly, the registration number follows the same tendency. In fact, 729’803 students had registered in 2006 compared to 801’721 students in 2021, for an increase of 9.85%, which is the highest record that has been stated so far. Also, the presence number broke the record in 2021 as it reached 795’209 students.
For the the analysis of the success rate and the rate of achievement
of honours. We plotted the Rate and grouped them by
Mention_type.
p <- France_results %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Rate in %")
ggplotly(p, tooltip = c("x" ,"y"))The number of students that were admitted without a mention has been decreasing since 2006. It reached 284’774 students (53.85%) in 2006, compared to 154’095 students (25.11%) in 2021. The negative tendency calmed itself during the 2017-2021 period. As it is the case for all mentions, there is a complete shift in 2017. This will be analysed in further detail in the part one of our analysis. Contrary to the students graduating without honours, the number of students were admitted with the TB distinction is showing a positive pattern. It grew from 31’563 (4.49%) in 2006 to reach 204’523 (23.98%) in 2021. The pattern for the two other distinctions are relatively similar as they remained stable up until 2017 but the change was not as drastic asfor the first two distinction mentionned. The number of students that are admitted with a good mention show a positive pattern. In fact, the number has developed from 91’719 students (13.66%) in 2006 to 180’774 students (25.07%) in 2021. The number of students admitted with an AB distinction is increased slightly from 2006 to 2021. It developed from 164’180 (28%) students in 2006 to 165’350 students (25.83%) in 2021, representing a growth rate of 0.71%.
We started with a comparative analysis in order to compare the success rate between the regions. For this, we needed to create the charts below based on the dnb-Results dataset.
We created this bar plot in order to visualize the average success
rate of each of the French regions. We used the function group_by in
order to group our success rate data by region: We took the
success_rate and the region, we computed the
average success rate for each region and, we plotted them in the graph
below.
p <- dnb_results %>%
select(success_rate_pct, region, department) %>%
group_by(region) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = region,
y = success_rate,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
labs(title = "Average success rate by region, 2006-2021", y = "Rate in %")+
theme_ipsum() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))The success rate is following a positive pattern. It has been increasing since 2006 in all French regions. On average, French regions have performed similarly over the years. However, the region of Bretagne took the lead, achieving a success rate of 89.8%. The region Nouvelle-Aquitaine followed with a success rate of 86.6%. On the third position, Auvergne-Rhône-Alpes recorded a success rate of 86.4%.
To have a clearer idea about the regions that performed the best, we created the following bar plot. It is based on summarized data to help us visualize the regions that recorded the highest number of admissions so far. Students’ distinction has once again been taken into consideration. Hence, we could have an idea about the regions that recorded the highest number of admissions for each of the distinctions until 2021.
p <- dnb_results %>%
select(admitted, region) %>%
group_by(region) %>%
summarise(admitted = sum(admitted)) %>%
ggplot(aes(x = region,
y = admitted,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Number of admitted by region") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_without, region) %>%
group_by(region) %>%
summarise(zero = sum(admitted_without)) %>%
ggplot(aes(x = region,
y = zero,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_AB, region) %>%
group_by(region) %>%
summarise(AB = sum(admitted_AB)) %>%
ggplot(aes(x = region,
y = AB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_B, region) %>%
group_by(region) %>%
summarise(B = sum(admitted_B)) %>%
ggplot(aes(x = region,
y = B,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))
p <- dnb_results %>%
select(admitted_TB, region) %>%
group_by(region) %>%
summarise(TB = sum(admitted_TB)) %>%
ggplot(aes(x = region,
y = TB,
fill = region)) +
geom_col() +
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Number of students")+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
ggplotly(p, tooltip = c("x" ,"y"))Based on the chart, Ile de France has recorded the highest number of admissions since 2006: 1 914 458 students got admitted in this region so far. The second position was assigned to Auvergne Rhône-Alpes, with a total number of admissions equal to 1 0276 655 students. On the third position, the region Hauts-de-France registered 99 0610 admissions since 2006. Also, Grand Est, Nouvelle-Aquitaine and Occitanic have performed similarly over the years, recording around 85 000 admissions over the period [2006-2021]. On the contrary, Corse has recorded the least number of admissions so far: Only 41 541 students were admitted in this region over the same time interval.
Taking the distinction into consideration, we notice that the regions with the highest number of admissions are the ones with the highest number of all the distinctions. As a matter of fact, Ile de France continues to lead the number of highest honour (TB), high honours (B), honours (AB) and standard pass (without) admissions.
The results are not surprising. Indeed, according to statistics, Ile de France is the most populated region, with an approximative number of 12 262 544 inhabitants. Thus, we expected it to take the lead in terms of the number of admissions.
In order to take the time component into account, a linear chart has been created: We used the group_by function in order to group our success rate data by session and region. We relied on this chart to analyze the evolution of the success rate throughout the years for each of the regions. Also, it enabled us to take the admission’s distinction into consideration and thus, analyze the development of students’ performance in all French regions.
It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_results %>%
select(success_rate_pct, region, session) %>%
group_by(region, session) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = success_rate,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Success rate") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_results %>%
select(without_pct, region, session) %>%
group_by(region, session) %>%
summarise(zero = mean(without_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = zero,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with zero mention") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_results %>%
select(AB_pct, region, session) %>%
group_by(region, session) %>%
summarise(AB = mean(AB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = AB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention AB") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_results %>%
select(B_pct, region, session) %>%
group_by(region, session) %>%
summarise(B = mean(B_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = B,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention B") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_results %>%
select(TB_pct, region, session) %>%
group_by(region, session) %>%
summarise(TB = mean(TB_pct, na.rm = TRUE)) %>%
ggplot(aes(x = session,
y = TB,
color = region,
text = region)) +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_ipsum() +
ggtitle("Admitted with mention TB") +
ylab("Rate in %")
ggplotly(p, tooltip = c("text","x" ,"y" ))The standard pass admissions’ rate follows a negative pattern: It has been decreasing since 2006 for all the regions. It dropped from 57.7 % in 2006 to 24.9% in 2021 for Ile de France, from 49.6% to 24.4% for Nouvelle-Aquitaine and from 53.4 % to 22.6% for Auvergne-Rhône-Alpes.
In 2017 and 2020, the standard pass admission’s rate as well as the rate of admissions with honours dropped remarkably, hitting rock bottom in 2020 for all the regions. However, this tendency was disrupted in 2019, in which this number increased noticeably. In 2021, we noticed a reproduction of 2017’s positive pattern translating thus, a deterioration of students’ performance once again.
Conversely, the number of students succeeding with a great distinction (highest and high honours) has been increasing in all French regions since 2006, reaching its highest point in 2020. Similarly, we notice a strong negative evolution in 2019 that resumed in 2021.
Following our national and regional analysis, we wanted to reduce the scope of our analysis further in order to have a better understanding of our data. Hence, we processed our data on a departmental level. In order to do that, we had to create the following box plots based on the dnb-results dataset.
The first step of this analysis is to visualize the performance of
all French departments over the period [2006-2021]. In order to do that,
the following box plot has been created. For this, we needed to group
our data by department and session. Then, we
plotted the success rate of all French departments over the period
[2006-2021].
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = success_rate_pct_dep,
group = session,
fill = session,
text = department,
text2 = success_rate_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "", y = "Success rate in %",
title ="Success rate of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = without_pct_dep,
group = session,
fill = session,
text = department,
text2 = without_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with no mention in %",
title ="Rate of students with no mention of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = B_pct_dep,
group = session,
fill = session,
text = department,
text2 = B_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Bien in %",
title ="Rate of students with mention Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = AB_pct_dep,
group = session,
fill = session,
text = department,
text2 = AB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Assez Bien in %",
title ="Rate of students with mention Assez Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))It is to be noted that for a better visual interpretation the scale of the y-axis is adapted to each graph.
p <- dnb_pct_dep %>%
ggplot(aes(x = session,
y = TB_pct_dep,
group = session,
fill = session,
text = department,
text2 = TB_pct_dep)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_gradientn(colors = viridis(16))+
guides(fill = "none")+
labs( x= "",
y = "Rate of students with mention Très Bien in %",
title ="Rate of students with mention Très Bien of each Department by session")
ggplotly(p, tooltip = c("text","text2"))Based on our data analysis, the success rate is following a positive pattern. It has been increasing since 2006. In the same way, we notice an improvement of the students’ success rate in 2017 as well as a remarkable decline in 2019. A boost in the students’ performance was recorded in 2021 in which, the success rate in all departments has reached its highest level since 2006.
Also, the number of students succeeding with honours, or a standard pass distinction has been decreasing since 2006 for all France departments. It remarkably decreased in 2017 compared to the previous years, reaching its lowest level in 2020. However, the negative pattern resumed in 2021, in which the number of students graduating with honours or with a standard pass distinction started to increase.
In the same context, the number of students graduating with highest or high honours has been increasing since 2006 for all French departments. It peculiarly heightened in 2017 to reach its highest point in 2020, which reflects a noticeable improvement in the students’ performance.
The following box plot was created in order to compare the performance of the best and “worst” department. It allowed us to visualize the difference between the highest achieving department, Paris in 2020 and one of the least achieving one, Eure et Loir. We chose to display the results of each establishment to see the dispersion in the results within one department.
We plotted the different success rates of the Paris department for the year 2006 and 2020 using the filter function. We proceeded the same way to plot the success rates of the Eure et Loir region over the same period.
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))
Paris <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
filter(department == "PARIS", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Paris$Mention_type <- factor(Paris$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Paris %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Parisian establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2020") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2020")
ggplotly(p, tooltip = c("text","text2"))
Eure <- dnb_results %>%
select(school_id,establishment_name,department,session, contains("pct")) %>%
dplyr::filter(department == "EURE-ET-LOIR", session == "2006") %>%
pivot_longer(c(contains("pct")), # pivot longer to allow for a clean and easy boxplot graph with each pct
names_to = "Mention_type",
values_to = "Rate")
Eure$Mention_type <- factor(Eure$Mention_type, levels = c("success_rate_pct","without_pct", "AB_pct", "B_pct", "TB_pct")) #creation of factor to order the graph
p <- Eure %>%
ggplot(aes(x = Mention_type,
y = Rate,
fill = Mention_type,
text = establishment_name,
text2 = Rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
guides(fill = "none")+
scale_fill_viridis(discrete = TRUE) +
theme_ipsum() +
labs( x= "",
y = "Rate in %",
title ="Results for Eure et Loir establishments in 2006")
ggplotly(p, tooltip = c("text","text2"))Paris
On average, the success rate in all Paris establishments reached 93% in 2020. Representing thus, the highest performance recorded since 2006. Accordingly, the number of highest honour admissions outpaced the number of admissions with other distinctions.
Conversely, the success rate was much lower in 2006: It was around [ 40%- 60%], compared with [66%-100%] in 2020. In the same year, the number of students succeeding with a standard pass distinction was the highest, while the number of admissions with highest honour represented the smallest percentage.
Eure et Loire
In 2020, the results of the department of Eure et Loire followed the same pattern as those of the Paris department. It reached its peak in 2020, with an average success rate of 87.90%, compared to only 81% in 2006. Similarly, the number of admissions with highest honour outpaced the number of successes with other distinctions, in 2020.
However, in 2006, the number of students graduating with a standard pass distinction outstripped the one of those graduating with all the other distinctions.
Overall, in 2020, the Paris department outperformed the department of Eure et Loire, with an average success rate of 92.40% compared to only 87.90% for Eure et Loire. In the same context, the number of admissions with highest honour in the department of Paris outstripped largely the one of Eure et Loire. It was around [29.17% -55.21%] in Paris and around [14.52%-33.53%], in the department of Eure et Loire. Also, the number of standard pass admissions was remarkably higher in Eure et Loire. It represented 24.44% of the total number of admissions in 2020, compared to only 12.97% in Paris. Similarly, the department recorded an average rate of admissions with the honours distinction equal to 22.86%, compared to only 17.59% in Paris. We see that the dispersion pattern of the establishment is rather similar between the departments but changes between 2006 and 2020.
p <- ggplot() +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") +
geom_point(data = establishment_24, aes(x = longitude , y = latitude, text = department_fr, text2 = establishment), size = 0.5)+
coord_map() +
labs(x = "",
y = "",
title = 'Establishment labelled "Génération 2024') +
map_theme
ggplotly(p, tooltip = c("text", "text2"))establishment_24_dep_map <- left_join(x = map[,-6], y = establishment_24_dep, by = "department_fr")
p <- establishment_24_dep_map %>%
filter(session_started == 2017) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2017")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2018) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2018")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2019) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2019")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2020) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2020")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2021) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2021")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-establishment_24_dep_map %>%
filter(session_started == 2022) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(data = map, aes(long,lat, group = group), fill = "white", color = "grey") + # here to add the outline of all departments
geom_polygon(aes(fill= establishment), color = "black") +
scale_fill_viridis(name = "Number of establishment", limits = c(0, 50)) +
labs(x = "",
y = "",
title = " New establishment 2024 labellization in 2022")+
map_theme
ggplotly(p, tooltip = c("text","fill"))# We first need to count the number of establishment per department
nb_est <- dnb_results %>%
select(school_id, department_fr) %>%
group_by(department_fr) %>%
summarise(n_distinct(school_id))
# We include the values with the one from establishment_24
establishment_24_dnb_dep <- left_join(x = nb_est, y = establishment_24_dep)
# We measure the rate of labelled establishment
establishment_24_dnb_dep <- establishment_24_dnb_dep %>%
group_by(department_fr) %>%
summarise(establishment = sum(establishment),
nb_est = mean(`n_distinct(school_id)`)) %>% #use mean to keep the value of n_distinct(schoold_id)
mutate(rate = establishment/nb_est*100)
establishment_24_dnb_dep_map <- left_join(x = map[,-6], y = establishment_24_dnb_dep, by = "department_fr")
p <-establishment_24_dnb_dep_map %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= rate), color = "black") +
scale_fill_viridis(name = "Rate of labelled establishment in % ") +
labs(x = "",
y = "",
title = "Establishment 2024 in 2022")+
map_theme
ggplotly(p, tooltip = c("text","fill"))p <- student_housing %>%
select(session, external_students_secondary_education, half_boarders_students_secondary_education, boarding_students_secondary_education) %>%
group_by(session) %>%
summarise(external_students_secondary_education = sum(external_students_secondary_education, na.rm = TRUE),
half_boarders_students_secondary_education = sum(half_boarders_students_secondary_education, na.rm = TRUE),
boarding_students_secondary_education = sum(boarding_students_secondary_education, na.rm = TRUE)) %>%
gather("Housing_option", "students", 2:4) %>%
ggplot(aes(x=session,
y=students,
fill=Housing_option)) +
geom_bar(stat="identity", position=position_dodge())+
ggtitle("Students per housing option in France")+
xlab("Session")+
ylab("Number of students")+
scale_fill_viridis(discrete = TRUE)+
theme_ipsum()
ggplotly(p, tooltip = "y")
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = external_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = external_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of external students in %",
title ="Rate of external students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = half_boarders_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = half_boarders_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of half-boarders students in %",
title ="Rate of half-boarders students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))
p <- housing_dep %>%
ggplot(aes(x = as.factor(session),
y = boarding_students_rate,
group = session,
fill = session,
text = department_fr,
text2 = boarding_students_rate)) +
geom_boxplot()+
geom_jitter(width = 0.25, alpha = 0.5)+
scale_fill_viridis()+
guides(fill = "none")+
labs( x= "",
y = "Rate of boarding students in %",
title ="Rate of boarding students of each Department by session")+
theme_ipsum()
ggplotly(p, tooltip = c("text","text2"))The first visualisation we wanted to do was a barplot that would show the evolution of the total number of single-parent families in France from 2007 to 2008. To do this, we had to isolate, in a dataset called sp_, the sing_par and session variables, then summarized the number of single-parents with “sum”. In order to remove the years that did not interest us, we had to use the filter function. We then were able to use geom_bar.
sp1_ <- single_parent %>%
select(c("session", "sing_par")) %>%
group_by(session)%>%
summarise (sing_par = sum(sing_par, na.rm = TRUE))
p <- ggplot(data = sp1_, aes(x = as.factor(session),
y = sing_par,
fill = as.factor (session),
text = sing_par))+
geom_col(stat = "identity" )+
scale_fill_viridis(discrete = TRUE, "Session") +
labs(x = "", y = "Number of single parent families")+
theme_ipsum()
ggplotly(p, tooltip = "text")Based on the barplot, we can make an overall analysis of the evolution of single-parent families in France over the years. Although we are missing several years, we can clearly see that the number of single-parent families has continued to increase since 2007. In fact, single-parent families numbered 2,427,110 in 2007, whereas in 2018, they were numbering 3,031,823. In just over 10 years, there has been a 20% increase.
We need to create a data set needed to create the maps.For this, we
join the single_parent and map data sets using
department_fr.
jmap_sp<- left_join(x = map[,-6], y = single_parent, by = "department_fr")p <-jmap_sp %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= sing_par), color = "black") +
scale_fill_viridis(name = "Number of single parent families") +
labs(x = "",
y = "",
title = "Single parent families in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))From the map, we realise that the departments with the most single-parent families are those with the most inhabitants (Bien dans ma ville, 2022). This is an unsurprising result which influence our choice to create variables for the data to become easily comparable.
To do this we first have to join single_parents and dnb_results_dep
by department_fr and session to have the
number of students per department. From this, we use the mutate function
to create a new variable that divides sing_par by the
number of students admitted by department. Thanks to this, we were able
to create maps of France showing a proportion of the number of
single-parent families per student admitted from 2007 to 2018. To create
the interactive maps with nice colours, we used the function
ggplotly.
sing_dnb <- left_join(x = single_parent, y = dnb_results_dep, by = c("department_fr", "session"))
sing_dnb <- sing_dnb %>%
mutate(single_parent_per_student_admitted = sing_par/admitted)
#Join with the map data set for the mapping.
sing_dnb_map <- left_join(x = map[,-6], y = sing_dnb, by = "department_fr")
p <-sing_dnb_map %>%
filter(session == 2007) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student admitted in 2007")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2008) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2008")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2012) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2012")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2013) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2013")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2017) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2017")+
map_theme
ggplotly(p, tooltip = c("text","fill"))
p <-sing_dnb_map %>%
filter(session == 2018) %>%
ggplot(aes(x = long,
y = lat,
group = group,
text = department_fr)) +
geom_polygon(aes(fill= single_parent_per_student_admitted), color = "black") +
scale_fill_viridis(name = "Single parent families per student", limits = c(0, 8)) +
labs(x = "",
y = "",
title = "Single parent families per student in 2018")+
map_theme
ggplotly(p, tooltip = c("text","fill"))It can be noted that the departments in the East have the lowest single parent rate (Mayenne, Maine-et-Loire, Vendée), while the departments in the Iles-de-France region (Paris, Seine- Saint-Denis, Marne), those around the Mediterranean Sea (Bouche-du-Rhone, Pyrenees, Herault) and those in Corsica (Haute-Corse, Corse du Sud) have the highest rates. It will therefore be important to keep an eye on these departments when analyzing the results of Diploma Brevet National (DNB).
First, we want to get an overview of COVID positive cases in France
over the covid period. We select from covid_in_school the variables
positive and test_date, group by
test_date and summarise positive with sum. We then use
ggplot with test_date as the x-axis and
positive for the y-axis.
p <- covid_in_schools %>%
select(positive, test_date) %>%
group_by(test_date) %>%
summarise(positive = sum(positive)) %>%
ggplot( mapping = aes(x= test_date, y = positive)) +
geom_line() +
labs(title = "French Covid-19 cases for the age group 11 to 15 years old (2020-2022)", x = "Date", y = "Number of cases")+
theme_ipsum()
ggplotly(p, tooltip = c("x","y"))From the 2020 until the end of 2021, we notice waves of covid cases. Indeed, every 6 months, the number of positive cases increases to nearly 20’000 positive cases per day and then drops to around 1’000 positive cases per day. However, we observe a clear increase, with a peak of more than 235,000 positive cases per day, in January 2022 . After this increase, the number of positive cases fell significantly to remain around 2’000-10’000 positive cases per day. The 2022 academic year is therefore a year to keep an eye on, when comparing with the results of DNB.
For the second exploratory analysis, we want to compare the positive
covid cases by regions. As population differs between departemnts, it is
easier to compare using incidence_rate. To create the map,
we select from the covid_in_schools dataset the variables
department_code, incidence_rate,
session, region, department_fr
and test_date, do a group by region and
test_date and summarise positive by doing an average We
also use ggplotly with tooltip to display the important information.
p <- covid_in_schools %>%
select(c("department_code", "incidence_rate", "session", "region", "department_fr", "test_date", "positive")) %>%
group_by(region, test_date) %>%
summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE),
positive = sum(positive)) %>%
ggplot() +
geom_line(mapping = aes(x = test_date, y = incidence_rate, color = region))+
scale_color_viridis(discrete = TRUE) +
labs(title = "Covid-19 incidence rate for the age group 11 to 15 years old by region (2020-2022)", x = "Date", y = "Incidence rate") +
theme_ipsum()
ggplotly(p, tooltip = c("x","y", "color"))For the age group 11 to 15, Ile-de-France and Auvergne-Rhone-Alpes seem to be the two regions which were the most affected in the first waves. The Covid19 did not spare any department as we their incidence rate closely follow each others. The only slight exception is the wave in March-April 2022 where a few departments like Corse and Bretagne were more affected.
The last vizualization for the covid dataset is a mapping of incidence rate average by departments from session 2020 to 2023.
covidpos_dep <- covid_in_schools %>%
select(c("department_code", "incidence_rate", "session", "department_fr" )) %>%
group_by(department_fr, session) %>%
summarise(incidence_rate = mean(incidence_rate, na.rm = TRUE))
covidpos_dep <- left_join(x = map[,-6], y = covidpos_dep)It is to be noted that for a better visual interpretation the scale of is adapted to each graph.
p <- covidpos_dep %>%
filter(session == 2020) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2020")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale of is adapted to each graph.
p <- covidpos_dep %>%
filter(session == 2021) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2021")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale of is adapted to each graph.
p <- covidpos_dep %>%
filter(session == 2022) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2022")+
map_theme
ggplotly(p)It is to be noted that for a better visual interpretation the scale of is adapted to each graph.
p <- covidpos_dep %>%
filter(session == 2023) %>%
ggplot( aes(x= long, y= lat, group=group, text = department_fr)) +
geom_polygon(aes(fill= incidence_rate), color = "black") +
coord_map()+
scale_fill_viridis(name = "Incidence rate average in 2023")+
map_theme
ggplotly(p)During the 2020 session, the average incidence rate is relatively low in all departments of France, except in Mayenne where the incidence rate is 17.96, which is the highest rate.
During the 2021 session, the scale indicates a clear increase in the proportions of average incidence rate in each of the departments, ranging from 246.9 to 52.6 cases per department. The western part of France is much more affected than the eastern part. We can hypothesise that this increase may be due to the fact that these are the departments closest to the borders and other countries.
During the 2022 session, we can observe an overall increase in the incidence rate in each of the departments. Aveyron, Landes, Pyrénées-Atlantiques and Corse du Sud are the departments with the highest rates, with a maximum of 1201.
During the 2023 session, the incidence rate fell significantly, particularly in Ile de France. This is because the school year is not over yet. La Creuse is the department with the highest rate with 420.
p <- France_results %>%
ggplot(aes(x = session, y = Number_of_students, group = Candidates, color = Candidates))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
ggtitle("National DNB statistics") +
theme_ipsum() +
ylab("Number of students")
ggplotly(p, tooltip = c("x" ,"y"))talk about the new reform and sumarise what we have seen in eda
We first mapped the establishment labelled Generation 2024 onto a map of the average success rate for the brevet during the period 2006-20021. This visualisation helped us determine whether a clear link could be made or if further analysis need to be performed. To create the map, we first had to calculate the average success rate and join the result with the map data set.
result <- dnb_results %>%
select(department_fr, success_rate_pct) %>%
group_by(department_fr) %>%
summarise(success_rate = mean(success_rate_pct, na.rm = TRUE))
result_map <- left_join(x = map[,-6], y = result)Then, with the help of ggplot, we have superimposed the map of the average success rate and the dotted map of the localisation of the labelled establishment presented in chapter 4.2.
p <- ggplot() +
geom_polygon(data = result_map, aes(long,lat,
group = group,
fill = success_rate,
text = department_fr)) +
geom_point(data = establishment_24, aes(x = longitude, y = latitude, text2= establishment), size = 0.5)+
coord_map() +
scale_fill_viridis(name = "Average sucess rate, 2006-2021")+
labs(x = "",
y = "",
title = "Average success rate vs labelled Establishemnt") +
map_theme
ggplotly(p, tooltip = c("text2", "text") ) #due to a strange behavior of ggplotly we must add the "text" variable in the tooltip otherwise the success rate map is only partially displayed. Zooming on the map, one can see that there is no clear pattern of correlation between high achieving departments and great number of labelled establishment. Indeed, one can see that for example, there are many labelled establishment around Paris and the success rate varies greatly. Building from this, we do not expect to have a strong relation between the success rate and the labelling or not of an establishment.
To test this statement, we did a linear regression at the
establishment level. We used the Generation 2024 label as a boolean with
1 for labelled establishment and 0 for unlabelled ones. To create this
variable we first had to add the school_id and
session_started variables from establishment_24 to
dnb_results. We used left_join by the variable school_id to
keep all the establishments present in dnb_results. To add the variable
est_24, we used the mutate coupled with the case_when to
cover all cases. est_24 takes the value 1 when the
session is greater or equal than
session_started and the value 0 for all other cases which
are when session is smaller than
session_started or the establishment is not labelled and
the session_startedvalue is NA.
est_24_join <- establishment_24 %>%
select(school_id, session_started)
est_24_dnb <- left_join( x = dnb_results, y = est_24_join, by = c("school_id"))
est_24_dnb <- est_24_dnb %>%
filter(session >= 2017) %>%
mutate(est_24 = case_when(session >= session_started ~ 1,
session < session_started ~ 0,
is.na(session_started) == TRUE ~ 0 ))To have a visual representation of the possible relation between the
success rate and the labelling, we used geom_bin_2d with
est_24 as factor for the x-axis and
success_rate_pctfor the y-axis. We used this way of
plotting to display in a more visual way the dispersion of the
establishments on the y-axis. To do the linear regression we used the lm
function.
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = success_rate_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
lm_est_24_dnb <- lm(data = est_24_dnb, success_rate_pct ~ est_24)
tab_model(lm_est_24_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 88.09 | 88.00 – 88.19 | <0.001 |
| est 24 | -0.60 | -1.00 – -0.19 | 0.004 |
| Observations | 42472 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
The result of this linear regression shows that there could be a
potential negative aspect to have the Generation 2024 label. The R
squared being below 0.000 makes this analysis highly unreliable as it
would explain less than 0.01% of the variation in success rate.
We decided to continue our analysis and proceeded to perform a linear regression with each distinction. To performed them we followed the same steps as for the success rate regression analysis.
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = without_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
lm_est_24_dnb <- lm(data = est_24_dnb, without_pct ~ est_24)
tab_model(lm_est_24_dnb)| without_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.44 | 25.31 – 25.57 | <0.001 |
| est 24 | -1.21 | -1.77 – -0.65 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = AB_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
lm_est_24_dnb <- lm(data = est_24_dnb, AB_pct ~ est_24)
tab_model(lm_est_24_dnb)| AB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.68 | 25.58 – 25.77 | <0.001 |
| est 24 | -2.30 | -2.70 – -1.89 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.003 / 0.003 | ||
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = B_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
lm_est_24_dnb <- lm(data = est_24_dnb, B_pct ~ est_24)
tab_model(lm_est_24_dnb)| B_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.27 | 25.18 – 25.36 | <0.001 |
| est 24 | -0.43 | -0.81 – -0.05 | 0.027 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.000 / 0.000 | ||
ggplot(data = est_24_dnb )+
geom_bin_2d( aes( x = as.factor(est_24), y = TB_pct))+
labs(x = "est_24")+
theme_ipsum()+
scale_fill_viridis("Number of establishment")
lm_est_24_dnb <- lm(data = est_24_dnb, TB_pct ~ est_24)
tab_model(lm_est_24_dnb)| TB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 23.61 | 23.47 – 23.75 | <0.001 |
| est 24 | 3.94 | 3.34 – 4.53 | <0.001 |
| Observations | 42458 | ||
| R2 / R2 adjusted | 0.004 / 0.004 | ||
The R squared is unsurprisingly very low again meaning that we can not conclude any strong relation between students achievements and the Generation 2024 label. It is still to be mentioned that it affects positively the TB attribution rate and negatively the rest.
We can conclude that from our analysis we do not see any pattern that the Generation 2024 label influences has any influence on the results of the establishments. It is not that surprising as the objectives of the label is truly to promote the olympics in Paris in 2024 and link the schoold world to the sport’s one. It has no requirements on a amount of sport done during the school time. The total amount of sport practised by students is therefore probably not much greater in a labelled institution than in a non-labelled one, hence, the little influence.
To analyse the impact of the housing offerings of the establishments
on the results of their student, we first need to join the
student_housing data set and the dnb_results one. We filtered out the
unnecessary or repetitive variables of dnb_results such as the academy
name and code or the education sector. We used inner_join by
session, school_id and
department_fr to keep only establishment appearing in
student_housing.
dnb_prep_housing <- dnb_results %>%
select(session, school_id, registered:TB_pct)
housing_dnb <- inner_join(x = student_housing , y = dnb_prep_housing, by = c("session", "school_id", "department_fr"))It is to note that the number of students between
students_secondary_education and registered
does not match as students of the four years of college are accounted
for in students_secondary_education.
We evaluated the possibility of doing a multiple linear regression by
measuring the correlation of the external_students_rate,
half_boarders_students_rate and
boarding_students_rate variables.
corrplot(cor(housing_dnb[c(17:19)]), col = viridis(256))
The correlation is unsurprisingly very high as if for example one
establishment offers a meal for dinner most of the students will take it
and very few will eat at home. Therefore, we will use single linear
regression to asses whether there exist a link between student results
and their habitual place of eating and living.
We simply used the lm function for each offering. We did not include graphs in the final report as the geom_point were completely overloaded.
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ external_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 91.92 | 91.72 – 92.12 | <0.001 |
| external students rate | -0.10 | -0.10 – -0.09 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.070 / 0.070 | ||
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ half_boarders_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 82.04 | 81.64 – 82.43 | <0.001 |
|
half boarders students rate |
0.10 | 0.10 – 0.11 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.084 / 0.084 | ||
lm_housing_dnb <- lm(data = housing_dnb, success_rate_pct ~ boarding_students_rate)
tab_model(lm_housing_dnb)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 89.43 | 89.28 – 89.57 | <0.001 |
| boarding students rate | -0.06 | -0.07 – -0.05 | <0.001 |
| Observations | 15803 | ||
| R2 / R2 adjusted | 0.005 / 0.005 | ||
The p-value is excellent for each regression but as observed with student housing, the R squared is very low. However, we see some intersting variations in the intercept, a difference of 10%, that could prove to be insightful. To dig deeper into these variations, we performed a cluster analysis.
We performed the cluster analysis on the session 2021 and 2020 and results are very similar. We decided to leave the two analysis but going through one is sufficient to see the method used and understand the results.
To perform the cluster analysis, we filtered the session 2021 and
selected the variables department_fr, session,
school_id, external_students_rate,
half_boarders_students_rate,
boarding_students_rate from the joined data set. We then
removed th first three and scaled the rest of the data.
The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method and concluded that we need 4 clusters. The ratio between the between sum of square and the within sum of square is good at 81.5%.
fviz_nbclust(clust, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation
labs(subtitle = "Elbow method") #We can determine the optimal number of cluster, 4 clusters seems to be reasonable
# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)
# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)We can see that the four clusters are well defined with cluster one, two and four being differiated by the x-axis and cluster 3 spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster one and two have higher half boarder rate than group three and especially group four. Group four will represent establishment without any catering offerings for the students. Group three is establishments with at least some of the students in boarding schools.
fviz_pca_var(PCA(clust, graph = FALSE))To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.
clust1 <- tibble(department_fr = names(km.res$cluster),
cluster = km.res$cluster)
clust_housing_dnb$cluster <- clust1$cluster
housing_dnb_2021 <- inner_join(x = housing_dnb, y = clust_housing_dnb)To analyse the data by cluster we needed to summarise it by cluster.
You can see the results in the table below with an added variable
establishment which is the number of establishment present
in each cluster.
clust_h_dnb_2021 <- housing_dnb_2021 %>%
group_by(cluster) %>%
summarise(establishment = n(),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
round(digits = 2)
datatable(clust_h_dnb_2021, options =list(scrollX = "300px"))We visually represented the results in a spider chart using the radarchart function.
op <- par(mar=c(0, 0, 0, 0))
radar <- clust_h_dnb_2021 %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4)#, title = "Radar Graph")
# Add a legend
legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)
par(op)For better clarity, we have also displayed each cluster individually.
# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))
##### Cluster 1
#filter cluster 1
radar <- clust_h_dnb_2021 %>%
filter(cluster == 1)%>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 1", cex.main = 1 )
##### Cluster 2
radar <- clust_h_dnb_2021 %>%
filter(cluster == 2) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , pfcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 2", cex.main = 1 )
##### Cluster 3
radar <- clust_h_dnb_2021 %>%
filter(cluster == 3) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , pfcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 3", cex.main = 1 )
##### Cluster 4
radar <- clust_h_dnb_2021 %>%
filter(cluster == 4) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 4", cex.main = 1 )
par(op)The cluster analysis show very interesting results. our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the fourth cluster are the one achieving the best results with 29.3% of them receiving the distinction TB and only 20.8% not receiving any distinction. The rates are much worse if your school has a boarding offer as 35% of students do not receive any distinction and only 10% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.
To perform the cluster analysis, we filtered the session 2021 and
selected the variables department_fr, session,
school_id, external_students_rate,
half_boarders_students_rate,
boarding_students_rate from the joined data set. We then
removed th first three and scaled the rest of the data.
The goal of this cluster analysis is to define clusters relative to the offering of each establishment. Through this analysis, we aim to split the establishments by offering. As we have 3 different offering it would not make sense to do too many clusters. To define the right number of cluster for our kmeans clustering, we used the elbow method. For 2020, the deicision is not as clear as for 2021 and the hesitation is between three and four clusters. As we have four clusters in 2021, we decided to also take four cluster for the 2020 aalysis. The ratio between the between sum of square and the within sum of square is good at 81.1%.
fviz_nbclust(clust, kmeans, method = "wss") +
geom_vline(xintercept = 4, linetype = 2) + #add line for better visualisation
labs(subtitle = "Elbow method") #We can determine the optimal number of cluster, 4 clusters seems to be reasonable
# Compute k-means
km.res <- kmeans(clust, 4, nstart = 25)
# Visualize clusters using factoextra
fviz_cluster(km.res, clust,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)We can see that the four clusters are well defined with cluster one, three and four being differiated by the x-axis and cluster two spanning wider on the x-axis and on the y-axis. One can see what the dimensions represent in the graph below. Cluster three and four have higher half boarder rate than group two and especially group one. Group one will represent establishment without any catering offerings for the students. Group two is establishments with at least some of the students in boarding schools.
fviz_pca_var(PCA(clust, graph = FALSE))To be able to continue our analyisis on the cluster we just measured, we need to implement the data in the main data set. We gathered the cluster data in a new data frame then added the column of cluster to the data set used for the cluster and then joined it with inner_join to the main data set to keep the filter we had applied.
clust1 <- tibble(department_fr = names(km.res$cluster),
cluster = km.res$cluster)
clust_housing_dnb$cluster <- clust1$cluster
housing_dnb_2020 <- inner_join(x = housing_dnb, y = clust_housing_dnb)To analyse the data by cluster we needed to summarise it by cluster.
You can see the results in the table below with an added variable
establishment which is the number of establishment present
in each cluster.
clust_h_dnb_2020 <- housing_dnb_2020 %>%
group_by(cluster) %>%
summarise(establishment = n(),
external_students_rate = mean(external_students_rate, na.rm = TRUE),
half_boarders_students_rate = mean(half_boarders_students_rate, na.rm = TRUE),
boarding_students_rate = mean(boarding_students_rate, na.rm = TRUE),
without_pct = mean(without_pct, na.rm = TRUE),
AB_pct = mean(AB_pct, na.rm = TRUE),
B_pct = mean(B_pct, na.rm = TRUE),
TB_pct = mean(TB_pct, na.rm = TRUE),
success_rate_pct = mean(success_rate_pct, na.rm = TRUE)) %>%
round(digits = 2)
datatable(clust_h_dnb_2020, options =list(scrollX = "300px"))We visually represented the results in a spider chart using the radarchart function.
op <- par(mar=c(0, 0, 0, 0))
radar <- clust_h_dnb_2020 %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= viridis(4, alpha = 1) , pfcol= viridis(4, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4)#, title = "Radar Graph")
# Add a legend
legend(x=1.2, y=-0.4, legend = c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"), bty = "n", pch=20 , col=viridis(4) , text.col = "grey", cex=0.4, pt.cex=1)
par(op)For better clarity, we have also displayed each cluster individually.
# we need to set the margin and create two rows to display the graphs
op <- par(mar=c(0, 1, 1, 0),mfrow=c(2, 2))
##### Cluster 1
#filter cluster 1
radar <- clust_h_dnb_2020 %>%
filter(cluster == 1)%>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.5) , pfcol= rgb( red = 0.267, green = 0.00392, blue = 0.329, alpha = 0.2) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 1", cex.main = 1 )
##### Cluster 2
radar <- clust_h_dnb_2020 %>%
filter(cluster == 2) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , pfcol= rgb( red = 0.192 , green = 0.40784, blue = 0.557, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 2", cex.main = 1 )
##### Cluster 3
radar <- clust_h_dnb_2020 %>%
filter(cluster == 3) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , pfcol= rgb( red = 0.208 , green = 0.71765, blue = 0.475, alpha = 0.5) , plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 3", cex.main = 1 )
##### Cluster 4
radar <- clust_h_dnb_2020 %>%
filter(cluster == 4) %>%
select(-c(cluster, establishment))
# To use the fmsb package, I have to add 2 lines to the dataframe: the max and min of each variable to show on the plot!
radar <- rbind(rep(100,50) , rep(0,50) , radar)
# plot the radar chart with the right color from viridis
radarchart(radar, axistype=1 ,
#custom polygon
pcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), pfcol= rgb( red = 0.992 , green = 0.90588, blue = 0.145, alpha = 0.5), plwd=1 , plty=1,
#custom the grid
cglcol="grey", cglty=1, axislabcol="grey", caxislabels=seq(0,100,25), cglwd=0.4, calcex = 0.4,
#custom labels
vlcex=0.4, title = "Cluster 4", cex.main = 1 )
par(op)The cluster analysis show very interesting results. Our intuition from the regression analysis is confirmed as the success rate varies between the four cluster. The highest achieving establishment are the one offering a canteen for lunch where the children stay at school playing with their friends. The lowest success rate is the group going home for lunch where each children is separated and each reality greatly differs. Indeed, some students will have a comforting environment at home whereas other will have a hard time at home or at someone else’s home. The difference is also quite marked for some of the distinctions. The achieving rate for the distinction AB and B are rather similar across clusters. Student from an establishment of the first cluster are the one achieving the best results with 33.3% of them receiving the distinction TB and only 17.9% not receiving any distinction. The rates are much worse if your school has a boarding offer as 26.4% of students do not receive any distinction and only 14.7% get the TB distinction. This could be due that some of them are establishment for elite athletes which start to focus more and more on their sports.
From this analysis, we can conclude that having lunch and sleeping at school or at home does have a small influence on the results of the Dimplome National du Brevet. Better results are achieved students in establishment offering the lunch. Going home for lunch does seem to hinder academic success. Boarding schools might not the best choice for students aiming for top honours.
The first analysis is at the national level.We want to use two plot types: a bar plot that represent total single parent families and a line plot to represent the success rate. Both plots use session for x-axis. Since the range of the success_rate plot is much lower than the single parent families plot, the use of a secondary y-axis with an adapted scale is recommended.
sp <- single_parent%>%
select("session", "department_fr", "sing_par")
singpar_vs_dnb <- left_join(dnb_results_dep, sp, by = c("department_fr","session"))
sp_vs_dnb <- singpar_vs_dnb%>%
select(success_rate_pct, session, sing_par, department_fr)%>%
group_by(session)%>%
summarise(success_rate_pct = mean(success_rate_pct),
sing_par = sum (sing_par, na.rm = TRUE)) %>%
ggplot()+
geom_col(aes(x = session, y = sing_par)) +
geom_line(aes(x = session, y = 30000*success_rate_pct), size = 1, color="blue", group = 1) +
scale_y_continuous(sec.axis = sec_axis(~./30000, name = "Success rate")) +
labs( x = "Session", y = "Single parent families" )+
theme_ipsum()
sp_vs_dnb
Despite the constant increase in the number of parental families over
time, DNB’s results do not appear to be influenced too much by these.
Indeed, logically, the more the number of single parent families
increase, the more the success rate should decrease. However, from 2007
to 2017, sing_par and success_rate both increased, and from 2017 to
2019, although sing_par continued to increase, the success_rate suffered
a drop from 88.74% to 86.14%.
We now want to do an analysis of single parent families against the results of DNB at the departmental level. We will therefore do a linear regression and use singpar_vs_dnb, a dataframe showing the results of DNB and sing_par par department_fr.
ggplot(data = singpar_vs_dnb,
mapping = aes(x = sing_par, y = success_rate_pct)) +
labs(title = "Single-parent families VS success rate", x = "sing_par", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)+
theme_ipsum()
#> Warning: Removed 960 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 960 rows containing missing values (`geom_point()`).
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
regression_sp <- lm(data = singpar_vs_dnb, success_rate_pct ~ sing_par)
tab_model(regression_sp)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 85.68 | 85.24 – 86.12 | <0.001 |
| sing par | -0.00 | -0.00 – -0.00 | 0.020 |
| Observations | 576 | ||
| R2 / R2 adjusted | 0.009 / 0.008 | ||
The 34 missing values are the overseas departments, which we did not
have the information for success_rate. Rsquared is equal to 0.009, and
p_value is equal to 0.0203 but according to the significance code, it is
not significant. Since variance is not explained, the model does not fit
our data. Therefore, success rate of a department does not depend on the
number of single parent families in the department.
In order to observe the influence of Covid on student performance, we are going to investigate the relationship between incidence rate and success rate of DNB in every department in France. To do so, we first have to do a left join between covid_in_school and dnb_results in order to create a new data frame called covid_vs_dnb, and then create a simple linear regression model.
c <- c("2020", "2021")
d<- dnb_results%>%
select("session", "department_fr", "success_rate_pct")%>%
filter(session %in% c)%>%
group_by(department_fr, session)%>%
summarise(success_rate_pct = mean(success_rate_pct))
co<- covid_in_schools%>%
select("session", "department_fr", "incidence_rate")%>%
group_by(department_fr, session)%>%
summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnb <- inner_join(d, co, by = c("session", "department_fr"))ggplot(data = covid_vs_dnb,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression <- lm(data = covid_vs_dnb, success_rate_pct ~ incidence_rate)
tab_model(regression)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 90.12 | 89.71 – 90.53 | <0.001 |
| incidence rate | -0.01 | -0.02 – -0.01 | <0.001 |
| Observations | 192 | ||
| R2 / R2 adjusted | 0.220 / 0.216 | ||
According to the residual results, the regression is more precise for
the high values than the low values. The p-value is close to 0,
therefore significant. Then, the model seems usable. However, R squared
is around 20% of the variation within the incidence rate, meaning it
explains only 20% of the variation. Therefore, the result here is not
very conclusive.
We then tried to do a linear regression for 2020 and 2021 separately because we thought that the 2020 low values could have distort our model.
covid_vs_dnb2020 <- inner_join(d, co, by = c("session", "department_fr"))%>%
filter(session ==2020)
ggplot(data = covid_vs_dnb2020,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression2020 <- lm(data = covid_vs_dnb2020, success_rate_pct ~ incidence_rate)
tab_model(regression2020)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 90.12 | 89.60 – 90.64 | <0.001 |
| incidence rate | 0.01 | -0.17 – 0.20 | 0.880 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.000 / -0.010 | ||
covid_vs_dnb2021 <- inner_join(d, co, by = c("session", "department_fr"))%>%
filter(session ==2021)
ggplot(data = covid_vs_dnb2021,
mapping = aes(x = incidence_rate, y = success_rate_pct)) +
labs(title = "incidence-rate of Covid cases VS success rate", x = "Incidence-rate", y = "Success rate", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression2021 <- lm(data = covid_vs_dnb2021, success_rate_pct ~ incidence_rate)
tab_model(regression2021)| success_rate_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 89.51 | 88.00 – 91.02 | <0.001 |
| incidence rate | -0.01 | -0.02 – -0.00 | 0.046 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.042 / 0.031 | ||
From these regression results, we cannot use these models to explain the influence of the incidence rate and the success rate. Consequently, covid seems to not have a big impact on the dnb results. This finding is not so surprising because of the implementation of continuous assessment. In fact, according to the article by l’Etudiant (Cojean, 2020), the French government has taken special measures for DNB exams following Covid: since 2020, graduates were awarded on the basis of continuous assessment, so no final tests.
Since our linear regressions were inconclusive, the next step would be to observe whether or not covid influenced the attribution of the distinction. It can be hypothesized that given the continuous assessment, the rating has been more lenient. This would therefore lead to better results, thus more students admitted or with mentions. In view of the results for 2020, which did not vary enormously, we will mainly focus on the attribution of distinction in 2021.To do so, we create covid_vs_dnbmention.
m<- dnb_results%>%
select("session", "department_fr", "TB_pct", "B_pct", "AB_pct", "admitted_without")%>%
filter(session == 2021)%>%
group_by(department_fr, session)%>%
summarise(TB_pct=mean(TB_pct), B_pct=mean(B_pct), AB_pct=mean(AB_pct), admitted_without=mean(admitted_without))
co<- covid_in_schools%>%
select("session", "department_fr", "incidence_rate")%>%
group_by(department_fr, session)%>%
summarise(incidence_rate = mean(incidence_rate))
covid_vs_dnbmention <- inner_join(m, co, by = c("session", "department_fr"))ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = TB_pct)) +
labs(title = 'Incidence-rate of Covid cases VS Distinction "Très Bien"', x = "Incidence-rate", y = "Percentage of TB", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, TB_pct ~ incidence_rate)
tab_model(regression)| TB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 20.21 | 17.70 – 22.73 | <0.001 |
| incidence rate | 0.02 | 0.00 – 0.04 | 0.016 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.062 / 0.052 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = B_pct)) +
labs(title = 'Incidence-rate of Covid cases VS Distinction "Bien"', x = "Incidence-rate", y = "Percentage of B", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, B_pct ~ incidence_rate)
tab_model(regression)| B_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 25.68 | 24.36 – 27.00 | <0.001 |
| incidence rate | -0.00 | -0.01 – 0.00 | 0.281 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.013 / 0.002 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = AB_pct)) +
labs(title = 'incidence-rate of Covid cases VS Distinction "Assez Bien"', x = "Incidence-rate", y = "Percentage of AB", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
#> Warning: Removed 2 rows containing non-finite values
#> (`stat_smooth()`).
#> Warning: Removed 2 rows containing missing values (`geom_point()`).
regression <- lm(data = covid_vs_dnbmention, AB_pct ~ incidence_rate)
tab_model(regression)| AB_pct | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 28.59 | 27.14 – 30.05 | <0.001 |
| incidence rate | -0.02 | -0.03 – -0.01 | 0.002 |
| Observations | 94 | ||
| R2 / R2 adjusted | 0.099 / 0.089 | ||
ggplot(data = covid_vs_dnbmention,
mapping = aes(x = incidence_rate, y = admitted_without)) +
labs(title = "incidence-rate of Covid cases VS Admitted Without Distinction", x = "Incidence-rate", y = "Percentage of students admitted without distinction", ) +
geom_point() +
geom_smooth(method = lm,
color = "blue",
size = 0.3)
regression <- lm(data = covid_vs_dnbmention, admitted_without ~ incidence_rate )
tab_model(regression)| admitted_without | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 9.87 | 6.89 – 12.85 | <0.001 |
| incidence rate | 0.05 | 0.03 – 0.07 | <0.001 |
| Observations | 96 | ||
| R2 / R2 adjusted | 0.204 / 0.196 | ||
In view of the various linear regressions, as covid cases increased, the academic government seemed to have granted more easily the highest distinction and without distinction. In fact, the linear regression of distinction TB and admitted_without indicate moderate positive correlations. This therefore leads to think that the grading was more lenient for the departments where covid was very present. Nevertheless, it is difficult to interpret these results as they are statistically not significant and all Rsquared are extremely poor.
Since the linear regression does not allow us to say much about the influence of Covid on student success, we decided to do an analysis from clusters and see if we can get anything interesting results out of it.
cluster1<- covid_vs_dnb2021
cluster <- cluster1[-c(1,2)]
row.names(cluster) <- as.vector(t(cluster1[,1]))
cluster <- scale(cluster)
fviz_nbclust(cluster, kmeans, method = "wss") +
geom_vline(xintercept = 5, linetype = 2) +
labs(subtitle = "Elbow method")
km.res1 <- kmeans(cluster, 5, nstart = 25) #77.4%
fviz_cluster(km.res1, cluster,
ggtheme = theme_ipsum(),
repel = TRUE)+
scale_fill_viridis(discrete = TRUE)+
scale_color_viridis(discrete = TRUE)
We see on the plot that the success rates do not differ too much between
the clusters despite the difference of incidence rate. So, as seen in
linear regression, covid cases did not affect dnb results.
p <- clust_cov_dnb %>%
filter(cluster == 1) %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
labs(title = "Cluster 1", x = "Session", y = "Rate in %" ) +
theme_ipsum()
ggplotly(p, tooltip = c("x" ,"y"))p <- clust_cov_dnb %>%
filter(cluster == 3) %>%
ggplot(aes(x = session, y = Rate, group = Mention_type, color = Mention_type))+
geom_line()+
scale_color_viridis(discrete = TRUE) +
labs(title = "Cluster 3", x = "Session", y = "Rate in %" ) +
theme_ipsum()
ggplotly(p, tooltip = c("x" ,"y"))Regression template
#>
#> Call:
#> lm(formula = success_rate_pct ~ external_students_rate + half_boarders_students_rate +
#> boarding_students_rate, data = housing_dnb)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.072 -1.570 0.061 1.523 6.073
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 95.6941 4.6987 20.37 <2e-16 ***
#> external_students_rate -0.1034 0.0437 -2.37 0.019 *
#> half_boarders_students_rate -0.0576 0.0532 -1.08 0.280
#> boarding_students_rate NA NA NA NA
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.26 on 189 degrees of freedom
#> Multiple R-squared: 0.0726, Adjusted R-squared: 0.0628
#> F-statistic: 7.4 on 2 and 189 DF, p-value: 0.000805
#>
#> Call:
#> lm(formula = success_rate_pct ~ external_students_rate, data = housing_dnb)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.039 -1.541 0.048 1.501 7.130
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 90.6304 0.4484 202.11 < 2e-16 ***
#> external_students_rate -0.0594 0.0161 -3.69 0.00029 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.26 on 190 degrees of freedom
#> Multiple R-squared: 0.0669, Adjusted R-squared: 0.062
#> F-statistic: 13.6 on 1 and 190 DF, p-value: 0.000293
cluster template
#> [1] 0.78
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning: ggrepel: 70 unlabeled data points (too many overlaps).
#> Consider increasing max.overlaps
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
#> Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x,
#> x$y, : font family not found in Windows font database
regression Essai
lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#>
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct +
#> dnb_results$AB_pct + dnb_results$without_pct)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.71e-09 0.00e+00 0.00e+00 0.00e+00 2.55e-11
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00e+02 1.17e-13 8.53e+14 <2e-16 ***
#> dnb_results$without_pct -1.00e+00 1.26e-15 -7.95e+14 <2e-16 ***
#> dnb_results$B_pct -1.00e+00 2.25e-15 -4.44e+14 <2e-16 ***
#> dnb_results$AB_pct -1.00e+00 1.50e-15 -6.66e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4.74e-12 on 135154 degrees of freedom
#> (29 observations deleted due to missingness)
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 3.04e+29 on 3 and 135154 DF, p-value: <2e-16cluster Essai
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
pairs(dnb_pct_dep[2:6])
distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion
mydata.hclust <- hclust(distance)
plot(mydata.hclust)